{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "WfGpInivS0fG"
   },
   "source": [
    "<h2 align=\"center\">点击下列图标在线运行HanLP</h2>\n",
    "<div align=\"center\">\n",
    "\t<a href=\"https://colab.research.google.com/github/hankcs/HanLP/blob/doc-zh/plugins/hanlp_demo/hanlp_demo/zh/ner_mtl.ipynb\" target=\"_blank\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
    "\t<a href=\"https://mybinder.org/v2/gh/hankcs/HanLP/doc-zh?filepath=plugins%2Fhanlp_demo%2Fhanlp_demo%2Fzh%2Fner_mtl.ipynb\" target=\"_blank\"><img src=\"https://mybinder.org/badge_logo.svg\" alt=\"Open In Binder\"/></a>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "IYwV-UkNNzFp"
   },
   "source": [
    "## 安装"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "1Uf_u7ddMhUt",
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "无论是Windows、Linux还是macOS，HanLP的安装只需一句话搞定："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "pp-1KqEOOJ4t",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "!pip install hanlp -U"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "0tmKBu7sNAXX",
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## 加载模型\n",
    "HanLP的工作流程是先加载模型，模型的标示符存储在`hanlp.pretrained`这个包中，按照NLP任务归类。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "EmZDmLn9aGxG",
    "outputId": "38469cbe-d56c-4648-b103-b67e6d22aeff",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'OPEN_TOK_POS_NER_SRL_DEP_SDP_CON_ELECTRA_SMALL_ZH': 'https://file.hankcs.com/hanlp/mtl/open_tok_pos_ner_srl_dep_sdp_con_electra_small_20201223_035557.zip',\n",
       " 'OPEN_TOK_POS_NER_SRL_DEP_SDP_CON_ELECTRA_BASE_ZH': 'https://file.hankcs.com/hanlp/mtl/open_tok_pos_ner_srl_dep_sdp_con_electra_base_20201223_201906.zip',\n",
       " 'CLOSE_TOK_POS_NER_SRL_DEP_SDP_CON_ELECTRA_SMALL_ZH': 'https://file.hankcs.com/hanlp/mtl/close_tok_pos_ner_srl_dep_sdp_con_electra_small_20210111_124159.zip',\n",
       " 'CLOSE_TOK_POS_NER_SRL_DEP_SDP_CON_ELECTRA_BASE_ZH': 'https://file.hankcs.com/hanlp/mtl/close_tok_pos_ner_srl_dep_sdp_con_electra_base_20210111_124519.zip',\n",
       " 'CLOSE_TOK_POS_NER_SRL_DEP_SDP_CON_ERNIE_GRAM_ZH': 'https://file.hankcs.com/hanlp/mtl/close_tok_pos_ner_srl_dep_sdp_con_ernie_gram_base_aug_20210904_145403.zip',\n",
       " 'UD_ONTONOTES_TOK_POS_LEM_FEA_NER_SRL_DEP_SDP_CON_MT5_SMALL': 'https://file.hankcs.com/hanlp/mtl/ud_ontonotes_tok_pos_lem_fea_ner_srl_dep_sdp_con_mt5_small_20210228_123458.zip',\n",
       " 'UD_ONTONOTES_TOK_POS_LEM_FEA_NER_SRL_DEP_SDP_CON_XLMR_BASE': 'https://file.hankcs.com/hanlp/mtl/ud_ontonotes_tok_pos_lem_fea_ner_srl_dep_sdp_con_xlm_base_20210602_211620.zip',\n",
       " 'NPCMJ_UD_KYOTO_TOK_POS_CON_BERT_BASE_CHAR_JA': 'https://file.hankcs.com/hanlp/mtl/npcmj_ud_kyoto_tok_pos_ner_dep_con_srl_bert_base_char_ja_20210914_133742.zip'}"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import hanlp\n",
    "hanlp.pretrained.mtl.ALL # MTL多任务，具体任务见模型名称，语种见名称最后一个字段或相应语料库"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "w0lm87NUsMwW"
   },
   "source": [
    "调用`hanlp.load`进行加载，模型会自动下载到本地缓存。自然语言处理分为许多任务，分词只是最初级的一个。与其每个任务单独创建一个模型，不如利用HanLP的联合模型一次性完成多个任务："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "id": "6Evnxsa0sMwW",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "HanLP = hanlp.load(hanlp.pretrained.mtl.CLOSE_TOK_POS_NER_SRL_DEP_SDP_CON_ELECTRA_BASE_ZH)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "bPUHdNJ-sMwW"
   },
   "source": [
    "## 命名实体识别"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "wxctCigrTKu-"
   },
   "source": [
    "同时执行所有标准的命名实体识别："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "Zo08uquCTFSk",
    "outputId": "21be671b-ead0-43c9-cc3a-32c305d8be29"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"tok/fine\": [\n",
      "    [\"2021年\", \"HanLPv2.1\", \"为\", \"生产\", \"环境\", \"带来\", \"次\", \"世代\", \"最\", \"先进\", \"的\", \"多\", \"语种\", \"NLP\", \"技术\", \"。\"],\n",
      "    [\"阿婆主\", \"来到\", \"北京\", \"立方庭\", \"参观\", \"自然\", \"语义\", \"科技\", \"公司\", \"。\"]\n",
      "  ],\n",
      "  \"ner/msra\": [\n",
      "    [[\"2021年\", \"DATE\", 0, 1], [\"HanLPv2.1\", \"WWW\", 1, 2]],\n",
      "    [[\"北京\", \"LOCATION\", 2, 3], [\"立方庭\", \"LOCATION\", 3, 4], [\"自然语义科技公司\", \"ORGANIZATION\", 5, 9]]\n",
      "  ],\n",
      "  \"ner/pku\": [\n",
      "    [],\n",
      "    [[\"北京立方庭\", \"ns\", 2, 4], [\"自然语义科技公司\", \"nt\", 5, 9]]\n",
      "  ],\n",
      "  \"ner/ontonotes\": [\n",
      "    [[\"2021年\", \"DATE\", 0, 1], [\"HanLPv2.1\", \"ORG\", 1, 2]],\n",
      "    [[\"北京立方庭\", \"FAC\", 2, 4], [\"自然语义科技公司\", \"ORG\", 5, 9]]\n",
      "  ]\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "print(HanLP(['2021年HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。', '阿婆主来到北京立方庭参观自然语义科技公司。'], tasks='ner*'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "每个四元组表示`[命名实体, 类型标签, 起始下标, 终止下标]`，下标指的是命名实体在单词数组中的下标，单词数组默认为第一个以`tok`开头的数组。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "cqEWnj_7p2Lf"
   },
   "source": [
    "任务越少，速度越快。如指定仅执行命名实体识别，默认MSRA标准："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 572
    },
    "id": "BqEmDMGGOtk3",
    "outputId": "33790ca9-7013-456f-c1cb-e5ddce90a457"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Token    \tNER Type        \n",
      "─────────\t────────────────\n",
      "2021年    \t───►DATE        \n",
      "HanLPv2.1\t───►WWW         \n",
      "为        \t                \n",
      "生产       \t                \n",
      "环境       \t                \n",
      "带来       \t                \n",
      "次世代      \t───►DATE        \n",
      "最        \t                \n",
      "先进       \t                \n",
      "的        \t                \n",
      "多        \t                \n",
      "语种       \t                \n",
      "NLP      \t                \n",
      "技术       \t                \n",
      "。        \t                \n",
      "阿婆主      \t                \n",
      "来到       \t                \n",
      "北京       \t◄─┐             \n",
      "立方庭      \t◄─┴►ORGANIZATION\n",
      "参观       \t                \n",
      "自然       \t◄─┐             \n",
      "语义       \t  │             \n",
      "科技       \t  ├►ORGANIZATION\n",
      "公司       \t◄─┘             \n",
      "。        \t                \n"
     ]
    }
   ],
   "source": [
    "HanLP('2021年HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。阿婆主来到北京立方庭参观自然语义科技公司。', tasks='ner').pretty_print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "jj1Jk-2sPHYx"
   },
   "source": [
    "执行OntoNotes命名实体识别："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 572
    },
    "id": "1goEC7znPNkI",
    "outputId": "2a97331c-a5fb-4d3c-ccf2-ce2186616c57",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Token    \tNER Type\n",
      "─────────\t────────\n",
      "2021年    \t───►DATE\n",
      "HanLPv2.1\t───►ORG \n",
      "为        \t        \n",
      "生产       \t        \n",
      "环境       \t        \n",
      "带来       \t        \n",
      "次世代      \t        \n",
      "最        \t        \n",
      "先进       \t        \n",
      "的        \t        \n",
      "多        \t        \n",
      "语种       \t        \n",
      "NLP      \t        \n",
      "技术       \t        \n",
      "。        \t        \n",
      "阿婆主      \t        \n",
      "来到       \t        \n",
      "北京       \t◄─┐     \n",
      "立方庭      \t◄─┴►ORG \n",
      "参观       \t        \n",
      "自然       \t◄─┐     \n",
      "语义       \t  │     \n",
      "科技       \t  ├►ORG \n",
      "公司       \t◄─┘     \n",
      "。        \t        \n"
     ]
    }
   ],
   "source": [
    "HanLP('2021年HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。阿婆主来到北京立方庭参观自然语义科技公司。', tasks='ner/ontonotes').pretty_print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 注意\n",
    "Native API的输入单位限定为句子，需使用[多语种分句模型](https://github.com/hankcs/HanLP/blob/master/plugins/hanlp_demo/hanlp_demo/sent_split.py)或[基于规则的分句函数](https://github.com/hankcs/HanLP/blob/master/hanlp/utils/rules.py#L19)先行分句。RESTful同时支持全文、句子、已分词的句子。除此之外，RESTful和native两种API的语义设计完全一致，用户可以无缝互换。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "P7CNTDBRsiYa"
   },
   "source": [
    "## 自定义词典"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ZXtRTXlBsmtw"
   },
   "source": [
    "自定义词典是NER任务的成员变量，要操作自定义词典，先获取一个NER任务。以MSRA为例："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "id": "QgY22h0AszsA"
   },
   "outputs": [],
   "source": [
    "ner = HanLP['ner/msra']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "_6fPzuyps98H"
   },
   "source": [
    "### 白名单词典\n",
    "白名单词典中的词语会尽量被输出。当然，HanLP以统计为主，词典的优先级很低。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 321
    },
    "id": "plNDyWhws5qg",
    "outputId": "7120d400-022c-42e9-fca9-febe3745d2c9"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Token\tNER Type   \n",
      "─────\t───────────\n",
      "2021年\t───►DATE   \n",
      "测试   \t           \n",
      "高血压  \t           \n",
      "是    \t           \n",
      "138  \t───►INTEGER\n",
      "，    \t           \n",
      "时间   \t           \n",
      "是    \t           \n",
      "午饭   \t◄─┐        \n",
      "后    \t◄─┴►TIME   \n",
      "2点45 \t───►TIME   \n",
      "，    \t           \n",
      "低血压  \t           \n",
      "是    \t           \n",
      "44   \t───►INTEGER\n"
     ]
    }
   ],
   "source": [
    "ner.dict_whitelist = {'午饭后': 'TIME'}\n",
    "doc = HanLP('2021年测试高血压是138，时间是午饭后2点45，低血压是44', tasks='ner/msra')\n",
    "doc.pretty_print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "aR_8TICmtw_E"
   },
   "source": [
    "### 强制词典\n",
    "如果你读过[《自然语言处理入门》](http://nlp.hankcs.com/book.php)，你就会理解BMESO标注集，于是你可以直接干预统计模型预测的标签，拿到最高优先级的权限。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 268
    },
    "id": "sWPljj3stsEA",
    "outputId": "99c4c281-a5b6-46bb-dffd-c1722fee7aee"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "To\tNER Type    \n",
      "──\t────────────\n",
      "他 \t            \n",
      "在 \t            \n",
      "浙江\t───►LOCATION\n",
      "金华\t───►LOCATION\n",
      "出生\t            \n",
      "， \t            \n",
      "他 \t            \n",
      "的 \t            \n",
      "名字\t            \n",
      "叫 \t            \n",
      "金华\t───►PERSON  \n",
      "。 \t            \n"
     ]
    }
   ],
   "source": [
    "ner.dict_tags = {('名字', '叫', '金华'): ('O', 'O', 'S-PERSON')}\n",
    "HanLP('他在浙江金华出生，他的名字叫金华。', tasks='ner/msra').pretty_print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "fkTC0GFxtinZ"
   },
   "source": [
    "### 黑名单词典\n",
    "黑名单中的词语绝对不会被当做命名实体。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 268
    },
    "id": "bIJpgdGauLJK",
    "outputId": "e74ec7ba-00fd-4958-d772-a1d1c40d1033"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "To\tNER Type    \n",
      "──\t────────────\n",
      "他 \t            \n",
      "在 \t            \n",
      "浙江\t───►LOCATION\n",
      "金华\t            \n",
      "出生\t            \n",
      "， \t            \n",
      "他 \t            \n",
      "的 \t            \n",
      "名字\t            \n",
      "叫 \t            \n",
      "金华\t            \n",
      "。 \t            \n"
     ]
    }
   ],
   "source": [
    "ner.dict_blacklist = {'金华'}\n",
    "HanLP('他在浙江金华出生，他的名字叫金华。', tasks='ner/msra').pretty_print()"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "collapsed_sections": [],
   "name": "ner_mtl.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
